-
Notifications
You must be signed in to change notification settings - Fork 5.9k
8356165: System.in in jshell replace supplementary characters with ?? #25079
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Welcome back jlahoda! A progress list of the required criteria for merging this PR into |
@lahodaj This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 241 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
Webrevs
|
@@ -977,7 +977,15 @@ public void perform(LineReaderImpl in) throws IOException { | |||
public synchronized int readUserInput() throws IOException { | |||
if (pendingBytes == null || pendingBytes.length <= pendingBytesPointer) { | |||
char userChar = readUserInputChar(); | |||
pendingBytes = String.valueOf(userChar).getBytes(); | |||
StringBuilder dataToConvert = new StringBuilder(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps, add here the comment from the PR description for readers from the future:
[...] when the current character is a high surrogate, peek at the next character, and if it is a low surrogate, convert both the high and low surrogates to bytes together.
The (internal) API used in the implementation doesn't express that on first sight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding comments.
if (pendingLine.length() > pendingLinePointer && | ||
Character.isLowSurrogate(pendingLine.charAt(pendingLinePointer))) { | ||
dataToConvert.append(readUserInputChar()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about combining readUserInputChar()
and (only when not surrogate pair but just isolated code unit) pendingLinePointer--
?
pendingLinePointer--
will be unlikely to be happen for normal inputs other than penetration tests.
inputSink.write("new String(System.in.readNBytes(4))\n\uD83D\uDE03\n"); | ||
waitOutput(out, "\"\uD83D\uDE03\""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the following is robuster:
- inputSink.write("new String(System.in.readNBytes(4))\n\uD83D\uDE03\n");
- waitOutput(out, "\"\uD83D\uDE03\"");
+ inputSink.write("new String(System.in.readNBytes(5))\n\uD83D\uDE031\n");
+ waitOutput(out, "\"\uD83D\uDE031\"");
I forgot to explain the context:
|
I missed the additional |
I will believe you have considered the difference of the length of EOL in Windows and Unix. |
Yes, the test handles both Unix and Windows EOL (that's the complicating factor). |
I suspect I've been misunderstood the argument for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
@@ -977,7 +977,20 @@ public void perform(LineReaderImpl in) throws IOException { | |||
public synchronized int readUserInput() throws IOException { | |||
if (pendingBytes == null || pendingBytes.length <= pendingBytesPointer) { | |||
char userChar = readUserInputChar(); | |||
pendingBytes = String.valueOf(userChar).getBytes(); | |||
StringBuilder dataToConvert = new StringBuilder(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I think we can avoid using StringBuilder (and make the code more RAM-friendly):
char[] dataToConvert = { useChar, '\0' };
// if (...) {
// ...
// if (...) {
// ...
dataToConvert[1] = lowSurrogate;
// }
// ...
// }
// low-surrogate code unit never be null char
pendingBytes = dataToConvert[1] != '\0' ? String.valueOf(dataToConvert) : String.valueOf(dataToConvert[0]);
The next version of .NET is said to be able to allocate such a tiny array to the stack, instead of the heap, but I don't know whether JVM can do the same optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could do something like that, although I wrote it as I wrote it mostly because that's more clearly correct. Although the current tests probably cover all the cases, so with a bit of work, we probably could eliminate the (explicit) array completely.
Overall, on most places, it is usually not necessary to be too clever - the VM can optimize and eliminate allocations if needed.
I'll leave it up to Adam and/or Christian whether they would prefer a slightly more complex code with less (explicit/visible) allocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, we could do this:
lahodaj@6a07648
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Readability of the code is my preference unless the performance is absolutely critical (not this case).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better than my suggestion.
The current code using StringBuilder is not bad because the act to pass a very long string to JShell seems to be something like shooting ourselves in the foot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can /integrate the current code with StringBuilder.
/integrate |
Going to push as commit e961b13.
Your commit was automatically rebased without conflicts. |
When reading from
System.in
in a JShell snippet, JShell first reads the whole line (getting aString
), and then converts this characters from thisString
to bytes on demand. But, it does not convert multi-surrogate code points correctly, it tries to convert each surrogate separately, which cannot work.The proposal herein is to, when the current character is a high surrogate, peek at the next character, and if it is a low surrogate, convert both the high and low surrogates to bytes together.
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/25079/head:pull/25079
$ git checkout pull/25079
Update a local copy of the PR:
$ git checkout pull/25079
$ git pull https://git.openjdk.org/jdk.git pull/25079/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 25079
View PR using the GUI difftool:
$ git pr show -t 25079
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/25079.diff
Using Webrev
Link to Webrev Comment